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Abstract — This paper establishes the state of the art in both de- 
terministic and randomized online permutation routing in the POPS 
network. Indeed, we show that any permutation can be routed online 
on a POPS(<i,g) network either with 0(| logg) deterministic slots, or, 
with high probability, with 5c\d/g\ + o{d/g) + O(loglogg) randomized 
slots, where constant c = exp(l + e ) ss 3.927. When d = 0(g), that 
we claim to be the "interesting" case, the randomized algorithm is 
exponentially faster than any other algorithm in the literature, both 
deterministic and randomized ones. This is true in practice as well. 
Indeed, experiments show that it outperforms its rivals even starting from 
as small a network as a POPS(2.2), and the gap grows exponentially with 
the size of the network. We can also show that, under proper hypothesis, 
no deterministic algorithm can asymptotically match its performance. 

Index Terms — Optical interconnections, partitioned optical passive star 
network, permutation routing. 



I. Introduction 

The ever-growing demand of fast interconnections in multiproces- 
sor systems has fostered a large interest in optical technology. All- 
optical communication benefits from a number of good characteristics 
such as no opto-electronic conversion, high noise immunity, and 
low latency. Optical technology can provide an enormous amount 
of bandwidth and, most probably, will have an important role in the 
future of distributed and parallel computing systems. 

The Partitioned Optical Passive Stars (POPS) network [1], [2], [3], 
[4] is a SIMD parallel architecture that uses a fast optical network 
composed of multiple Optical Passive Star (OPS) couplers. A d x d 
OPS coupler is an all-optical passive device which is capable of 
receiving an optical signal from one of its d sources and broadcast 
it to all of its d destinations. The number of processors of the 
network is denoted by n, and each processor has a distinct index 
in {0, ...,n— 1}. The n processors are partitioned into g groups 
of d processors, n = dg, in such a way that processor ( belongs 
to group group(i) := [i/d\ (see Figure 0. For each pair of groups 
a, b £ {0,...,g — 1}, a coupler c(b,a) is introduced which has all 
the d processors of group a as sources and all the d processors of 
group b as destinations. During a computational step (also referred 
to as a slot), each processor i receives a single message from one 
of the g couplers c(group(i),a), a 6 {0, ...,g— 1}, performs some 
local computations, and sends a single message to a subset of the g 
couplers c(£>,group(i)), b £ {0, . . . ,g— 1}. The couplers are broadcast 
devices, so this message can be received by more than one processor 
in the destination groups. In agreement with the literature, in the case 
when multiple messages are sent to the same coupler, we assume that 
no message is delivered. This architecture is denoted by POPS(d,g). 

One of the advantages of a POPS(d,g) network is that its diameter 
is one. A packet can be sent from processor i to processor j, 
i^=j, in one slot by using coupler c(group(_/),group(;)). However, 
its bandwidth varies according to g. In a POPS(n, 1) network, only 
one packet can be sent through the single coupler per slot. On the 
other extreme, a POPS(l.n) network is a highly expensive, fully 
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Fig. 1. A POPS(3,3). Processors are shown as circles, while optical passive 
stars are shown as boxes. Optical signals flow from the left to the right. The 
processors on the left and the processors on the right are the same objects 
shown twice for the sake of clearness. 

interconnected optical network using r? OPS couplers. A one-to-all 
communication pattern can also be performed in only one slot in the 
following way: Processor i (the speaker) sends the packet to all the 
couplers c(a,group(;)), a e {0,...,g— 1}, during the same slot all 
the processors j, j 6 {0, ...,n— 1}, can receive the packet through 
coupler c(group(j),group(;)). 

The POPS network has been shown to support a number of 
non trivial algorithms. Several common communication patterns are 
realized in [3]. Simulation algorithms for the ring, the mesh, and the 
hypercube interconnection networks can be found in [5] and [6]. 
Some reliability issues are analyzed in [7]. Algorithms for data 
sum, prefix sums, consecutive sum, adjacent sum, and several data 
movement operations are also described in [6] and [8]. Later, both 
the algorithms for hypercube simulation and prefix sums have been 
improved in [9]. An algorithm for matrix multiplication is provided 
in [10]. Moreover, [11] shows that POPS networks can be modeled 
by directed and complete stack graphs with loops, and uses this 
formalization to obtain optimal embeddings of rings and de Bruijn 
graphs into POPS networks. 

In [8], Datta and Soundaralakshmi claim that in most practical 
POPS(ii,g) networks it is likely that d > g. We believe that they 
are only partly right. While it is true that systems with d <C g are 
too expensive, it is also true that systems with d 2> g give too low 
parallelism to be worth building. We illustrate our point with an 
example. Consider the problem of summing 16/7 data values on a 
POPS(ii,g) network, d = g = ^fn. This network has n processors. 
Therefore, the algorithm can work as follows: we input 16 data values 
per processor, let each processor sum up its 16 data values, and finally 
we use the algorithm in [8] to get the overall sum. This algorithm 
requires 16 steps to input the data values and compute the local 
sums, plus 21ogy / « = log« slots for computing the final result. A 
total of 16 + logn slots. With the idea of upgrading our system, we 
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buy additional I5n processors and build a I6n processor POPS(ii',g') 
network with d' = I6d = 16^/7? and g 1 = g = y/n. Now, we can use 
just one step to input the data values, one per processor, and then use 
the same algorithm in [8] to get the overall sum. Unfortunately, this 
algorithm still requires 16 + log n slots, even though we are solving 
a problem of the same size using a system with 16 times more 
processors! 

The problem is not on the data sum algorithm in [8]. Essentially 
the same thing happens with the prefix sums algorithm in [8], the 
simulations in [6], and all the other algorithms in the literature for the 
POPS network we know of, including the ones presented in this paper. 
The point is that a POPS(<i,g) network can exchange g 2 messages 
at most in a slot. This is an unavoidable bottleneck for networks 
where d is much larger than g, resulting in the poor parallelism of 
these systems. Also, experience says that the case d = g is the most 
interesting from a "mathematical" point of view. In the past literature, 
the case d > g and symmetrically the case d < g are always dealt with 
by reducing them to the case d = g, that usually contains the "core" 
of the problem in its purest form. This work is not an exception to 
this empirical yet general rule. So, it is probably more reasonable 
to assume that practical POPS networks will have d = ©(g), that is 
d/g, and similarly g/d, bounded by a constant. 

In any case, finding good algorithms for the case d ^ g, both d < 
g and d > g, is of absolute importance, since it is not clear what 
is the optimal tradeoff between d, g, and the cost of the network 
yet. Furthermore, an optimal tradeoff may not exist in general, since 
it probably depends on the specific problem being solved. By the 
way, such algorithms are often non trivial, as, for example, in [8]. 
Therefore, we partly accept the claim in [8] that the number of groups 
cannot substantially exceed the number of processors per group. So, 
throughout the whole paper, we will discuss our asymptotical results 
assuming that g grows and that d = H(g). Nonetheless, we will keep 
in mind that the "important" case is likely to be d = ©(g). 

Here, we consider the permutation routing problem: Each of the 
n processors of the POPS network has a packet that is to be sent to 
another node, and each processor is the destination of exactly one 
packet. This is a fundamental problem in parallel computing and 
interconnection networks, and the literature on this topic is vast. As 
an excellent starting point, the reader can see [12]. On the POPS 
network, this problem has been studied in two different versions: the 
offline and the online permutation routing problem. In the former, 
the permutation to be routed is globally known in the network. 
Therefore, every processor can pre-compute the route for its packet 
taking advantage of this information. This version of the problem 
has been implicitly studied, for particular permutations, in all the 
simulation algorithms we reviewed above. Later, most of these results 
have been unified by proving that any permutation can optimally 
be routed off-line in one slot, when d = 1, and 2 [d/g] slots, when 

> 1 [13]. 

In the online version, every processor knows only the destination 
of the packet it stores. This problem has been attacked in [8]. The 
solution iteratively makes use of a sub-routine that sorts g 2 items 
in POPS(g,g) subnetworks of the larger POPS(d.g) network. The 
sub-routine is built by hypercube simulation starting from either 
Cypher and Plaxton's O(lognloglogrc) sorting algorithm for the n- 
processor hypercube or from Leighton's implementation [12] on 
the ^-processor hypercube of Batcher's odd-even merge sort algo- 
rithm [14]. In the first case, Datta and Sounderalakshmi get the 
asymptotically fastest algorithm for routing in the POPS network, 
running in 0(| loggloglogg) slots. In the second, they get an 
algorithm that turns out to be the fastest in practice, running in 
y log 2 g+ 2i^ + 31ogg + 7 slots. Recently, and independently of this 
work, Rajasekaran and Davila have presented a randomized algorithm 



for online permutation routing that runs in 0(^+logg) slots [15]. 

Our contribution is both theoretical and practical. We show that 
any permutation can be routed on a POPS(d,g) network either 
with 0(|logg) deterministic slots, or, with high probability, with 
5c\d/g~\ +o(d/g) + 0(log logg) randomized slots, where constant c = 
exp(l ) ~ 3.927. The deterministic algorithm is based on a direct 
simulation of the AKS network, and it is the first that requires only 
0(| logg) slots. When d = 0(g), that we claim to be the "interesting" 
case, the randomized algorithm is exponentially faster than any other 
algorithm in the literature, both deterministic and randomized ones. 
This is true in practice as well. Indeed, our experiments show that 
it outperforms its rivals even starting from as small a network as 
a POPS(2,2), and the gap grows exponentially with the size of 
the network. We can also show that, under proper hypothesis, no 
deterministic algorithm can asymptotically match its performance. 

This paper also presents a strong separation theorem between 
determinism and randomization. We build a meaningful and natural 
problem inspired on permutation routing in the POPS network such 
that there exists a O (log logg) slots randomized solution, and such 
that no deterministic solution can do better than O(logg) slots, that 
is exponentially slower. To the best of our knowledge, this is the first 
strong separation result from logg to log logg, and, quite interestingly, 
it does not make use of the notion of oblivious routing, that we show 
to be essentially out of target in the context of routing in the POPS 
network. 

II. A Deterministic Algorithm 

Let N m :={0,l,...,m— 1} denote the set of the first m natural 
numbers. In the on-line permutation routing problem we are given n 
packets, one per processor. Packet i £ N„, originates at processor i, 
the source processor, and has processor 7t(i) as destination, where % 
is a permutation of N„. 

The problem is to route all the packets to destination with as few 
slots as possible. Crucially, permutation n is not known in advance — 
at the beginning of the computation, each processor knows only the 
destination of the packet it stores. 

A. The Upper Bound 

So far, the best deterministic algorithm for online permutation 
routing on the POPS (c/,g) network is presented in [8]. The algorithm 
runs in 0(| log 2 g) slots. The computational bottleneck is a 0(log 2 g) 
sorting sub-routine that sorts g 2 data value \d/g] times, each on 
one of the \d/g] POPS(g,g) sub-networks into which the larger 
POPS(d,g) network is partitioned. The idea in [8] is to make each 
POPS(g,g) network simulate Leighton's 0(log 2 n) sorting algorithm 
for the n-processor hypercube [12], that is, in turn, an implementation 
of Batcher's odd-even merge sort. This is carried out by using a 
general result due to Sahni [6], showing that every move of a normal 
algorithm for the hypercube (where only one dimension is used for 
communication at each step) can be simulated with 2 \d/g~\ slots on 
a POPS network of the same size. Since Leighton's algorithm is 
normal, and since the sub-routine is always used on POPS (g,g) sub- 
networks, we get a constant factor slow-down. 

The algorithm in [8] is fairly good in practice, since hidden 
constants are small. However, we are interested in the best asymp- 
totical result. So, as suggested in [8], we can replace the Leighton 
implementation of Batcher's odd-even merge sort with Cypher and 
Plaxton's routing algorithm for the hypercube, that is asymptotically 
faster (though slower for networks of practical size), since it runs 
in O (log n log log n) time [16]. This yields a 0(j loggloglogg) slots 
algorithm for permutation routing on the POPS network, that is 
a good improvement. Nonetheless, here we do even better. Our 
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simple key idea is to simulate a fast sorting network directly on 
the POPS, instead of going through hypercube simulation. By giving 
an improved O(logg) upper bound for sorting on the POPS network, 
we also get an asymptotically faster algorithm for online permutation 
routing. 

A comparator [i : j], i,j £ N„ sorts the i-th and j-th element of 
a data sequence into non-decreasing order. A comparator stage is a 
composition of comparators [i\ : j\] o ■ • • o [i k : j k ] such that all i r and 
j s are distinct, and a sorting network is a sequence of comparator 
stages such that any input sequence of n data elements is sorted into 
non-decreasing order. An introduction to sorting networks can be 
found in [17]. Crucially, we can show that a POPS(d,g) network can 
efficiently simulate any comparator stage. 

Theorem 2.1 ([13]): A POPS(d,g) network can route off-line any 
permutation among the n = dg processors using one slot when d = 1 
and 2[d/g] slots when d > 1. 

Lemma 2.2: A POPS(d,g) network, n = dg, can simulate a com- 
parator stage in one slot, when d = 1, and in 2\d/g] slots, when 
d>\. 

Proof: Let [i\ : j\] o • • -o [i k ; j k ] be a comparator stage. We define 
a function % such that 7t(i r ) = j r and 7t(j r ) = i r for all r. Since all 
i r are distinct, and so are all j s , n can arbitrarily be extended in 
such a way to be a permutation. By Theorem 12.11 7t can be routed 
in one slot when d = 1, and 2\d/g] slots when d > 1. During this 
routing, for every r, processor i r sends its data value to processor j r 
and vice-versa. Then, processor i r discards the maximum of the two 
data values, while processor j r discards the minimum. ■ 
In [18], the AKS sorting network is presented. This network is able to 
sort any data sequence with only O(logrc) comparator stages, which 
is optimal. By simulating the AKS network on a POPS network using 
Lemma l2~2l we easily get the following theorem. 

Theorem 2.3: A POPS (g,g) network can sort g 2 data values in 
O(logg) slots. 

The above result is the key to improve on the best deterministic 
algorithm for online permutation routing in the literature. 

Corollary 2.4: A POPS(ii,g) network can route on-line any per- 
mutation in 0(|logg) slots. 

Proof: To get the claim, it is enough to plug the sorting 
algorithm of Theorem 12.31 into Stage 1 of the deterministic routing 
algorithm proposed in [8]. ■ 

This algorithm is not very practical. Indeed, it is based on the 
AKS network that, in spite of being optimal, is not efficient when 
n is small due to very large hidden constants. However, the result 
is important from a theoretical point of view because of two facts: 
it establishes that, in principle, O(^logg) slots are enough to solve 
deterministically the online permutation routing problem; and, when 
d = 0(g) and under proper hypothesis, it matches one of the lower 
bounds for deterministic algorithms in the next section. 

B. A Few Lower Bounds 

Borodin et al. [19] study the extent to which both complex 
hardware and randomization can speed up routing in interconnection 
networks. One of the questions they address is how oblivious routing 
algorithms (in which the possible paths followed by a packet depend 
only on its own source and destination) compare with adaptive rout- 
ing algorithms. Since oblivious routing can usually be implemented 
by using limited hardware resources on each node, it is important 
to understand whether it is worth using the more complex hardware 
required by adaptive routing. Here, we address similar questions. In 
the following, our discussion will be limited to the case d = ©(g). 

Unfortunately, the concept of oblivious routing does not seem to 
be useful for POPS networks. Indeed, by adapting the ideas first 



used in [20], we can prove that any oblivious deterministic routing 
algorithm needs £l(^J~g) slots to deliver correctly every permutation. 
Moreover, by customizing and slightly adapting the approach devel- 
oped in [19] (that makes use of Yao's minimax principle [21]), it is 
also possible to show that any oblivious randomized routing algorithm 
must use £l(logg/loglogg) slots on the average. 

Theorem 2.5: For any POPS(rf,g) network, d = 0(g), and any 
oblivious deterministic routing algorithm, there is a permutation for 
which the routing time is £l(^/g) slots. 

Proof: We essentially customize the proof in [20] to POPS 
networks, but also some minor modifications are in order to allow 
for passive devices and a few different assumptions. 

We assume d = g, the extension to d = ©(g) or wider involving no 
further ideas, only more technical fuss. Consider the bipartite digraph 
D = (V,A) having the set P of processors and the set C of couplers 
as color classes and having as arcs in A those pairs (p, c) such that 
processor p can send to coupler c plus those pairs (c,p) such that 
processor p can listen from coupler c. We have \P\ = n = dg = g 2 
processors and |C| = g 2 =n couplers, | V| = \P\ + \C\ = 2n; all nodes 
have in-degree and out-degree both equal to g. 

Every oblivious algorithm defines a directed a,fc-path, denoted 
with [a,b], for every pair (a,b) 6 P 2 , namely, the directed path 
of D followed by a packet with destination in b and origin in 
a. The characteristic vector X(a b] °f a P am ( a > ^] ls defined by 
regarding the path has the set of its nodes including b but not 
a. The congestion of a family n of directed paths is defined as 
c(n) '■= rnax.vevY,(a.b]enX(a.b]( v )- K i s clear that the congestion of 
Fl gives a lower bound on the number of steps required to move 
a packet along each path in n since no processor in P and no 
coupler in C can receive more than one different packet within 
a single slot. To prove the theorem we do the following: with 
reference to the path family {(a,b] \(a,b) £ P 2 } determined by the 
oblivious algorithm under consideration, we show how to construct 
a permutation n : P i— > P such that c({(a,7t(a)] \a £ P}) > ^/g/2. 
This will imply the stated lower bound regardless of the queueing 
discipline, however omniscent, employed by the algorithm. For every 
b 6 P, let S b := {v 6 V \ZaeP\{b} X(a.b\( v ) > V#/ 2 }- Clearly, every 
path (a,b\, a ^ Sb, must have a last node not in Sb- Moreover, since 
b £ Sb, the next node on the path (a,b] must be in Sb- Let Xb be 
the set of these last nodes when a ranges in P\Sb- By definition 
of Sb, no node in Xb can be the last node outside Sb for more 
than ^/g/2 such paths, hence |-P\5;,| < \Xb\(^/g/2), which implies 
\Sb\ > yfg in case \X b \ < g^fg. Moreover, \Xb\ < g\Sb\ since the in- 
degree of the network is bounded by g. This implies \S b \ > ^/g in 
the complementary case that \Xb\ > gy/g. In conclusion, \Sb\ > yfg 
holds for every b G P. Therefore, by an averaging argument, there 
must exist a v 6 V which belongs to at least ^-jyp = °^ th ese sets 
S b , b e P. Let B = {b e P\v 6 S b }. Let b\ ,b 2 , . . . ,b be distinct 
processors in B and run the following greedy algorithm where for all 
processors p in P the value 7t(p) is initially undefined. 

For i := 1 to y r g/2, let a be any processor in S^. such that 
n(a) is undefined and define Tt(a) := b,. 

Notice that such an a can be found at each step i < ^/g/2 since at 
step ( at most i values of n have been defined, while Sb, > y/g. More- 
over, n can be clearly extended to a full permutation, while already 
c({(a,n(a)] \7t(a) is defined}) > |{a|^(a) is defined}| = ^/g/2 since 
node v belongs to each path (a, Tl(a)\ by construction. ■ 

Theorem 2.6: For any POPS(rf,g) network, d = ©(g), and any 
oblivious deterministic routing algorithm, the expected routing time 
for a random permutation (with each permutation chosen with uni- 
form probability) is £2(logg/loglogg). 

Proof: The proofs to be customized and adapted here come 
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from [19]. The customization starts again by considering the bipartite 
digraph D = (V,A) on color classes P and C introduced in the proof 
of Theorem 12.51 Also the various small adjustment are in analogy 
with those detailed in the proof of Theorem 12.51 ■ 

Corollary 2.7: For any POPS(c/,g) network, d = ©(g) and any 
oblivious deterministic routing algorithm, there is a permutation for 
which the expected routing time is £l(logg/loglogg). 

Proof: To get this corollary of Theorem 12.61 use the Yao's 
minimax principle [21] in perfect analogy to what is done in [19]. ■ 

These complexities are not satisfactory. Indeed, here in this paper 
we show a non-oblivious deterministic algorithm that runs in O(logg) 
slots and a non-oblivious randomized one that runs in O(loglogg) 
slots with high probability. So, by restricting to oblivious algorithms, 
it may be true that we get a (somewhat) simpler processor, but we 
also lose an exponential factor in running time, both with and without 
randomization. This is not a good deal. Therefore, we will not discuss 
oblivious routing any more, and will focus only on adaptive routing. 

Finding good lower bounds for adaptive deterministic routing is 
not trivial. In [19], the authors explicitly say that they were not able 
to provide any result for this case in their context. Here, we give 
partial answers. First, we prove a fl(logg) tight lower bound for a 
special case of adaptive deterministic routing that applies both to the 
hypercube simulation routing algorithm in [8] and to our deterministic 
algorithm (that is, in this context, optimal). Second, we prove a 
strong separation theorem between determinism and randomization. 
Indeed, we can show both a ft(logg) lower bound for a class of 
adaptive deterministic routing algorithms, and a O(loglogg) upper 
bound for the same class where processors are allowed to generate 
and use randomization. To the best of our knowledge, this is the first 
separation theorem showing a gap between logn and log log n. 

Consider our deterministic routing algorithm, proposed in the 
previous section. It is based on a simulation of the AKS sorting 
network. At every slot, each processor sends its packet to a pre- 
determined other processor, according to the comparator it is going 
to simulate in the slot. So, the communication patterns are fixed for 
the whole computation, and do not depend on the input permutation. 
We can prove a lower bound for all algorithms that have the same 
property. More formally, a routing algorithm is called rigid if, at every 
slot /, each processor i sends one of the packets it currently stores 
to the set of groups C out (i,t), and listens to group q n (i,f), where 
functions C ou t and q n depend solely on t and on the processor index. 
Here, we can assume that the processors have enough local memory 
to store a copy of all the packets they have seen so far and that they 
choose the packet to send according to any strategy or algorithm. 
This is enough to get the following lower bound. 

Theorem 2.8: Any deterministic and rigid algorithm for online 
permutation routing on the POPS(d.g) network, d = 0(g), must use 
ii(logn) slots. 

Proof: Consider a processor i. Let P(i,t) be the set of all 
packets that are potentially stored by processor i at slot t, according 
to the routing algorithm. At the beginning, P(i,0) = {pi}- During 
slot /, processor i can receive at most one packet from group C[ n (i,t). 
Assume this packet comes from processor j. Index j is statically 
determined and is independent of the initial permutation, since the 
algorithm is rigid. So, either P(i,t) = P(i,t — 1) U P(j,t — 1) or 
P(i,t) = P(i,t — 1), if no packet is sent to group q n (i,f) (because there 
is no such processor j, or a conflict occurred). Therefore, |-P((,f)| < 2' 
for all t > 0. 

Now, assume that the algorithm stops after / < logn slots. Then, 
|P(/,f)| < n, and there exists h such that pi, P(i,t). As a conse- 
quence, the routing algorithm must fail for all input permutations 
such that the destination of pi, is processor i. We conclude that 
r = £2(logn). ■ 



This bound applies to both the 0(log 2 g) algorithm in [8] and to our 
deterministic algorithm in the previous section. Therefore, within the 
class of rigid algorithms, our proposed routing scheme is optimal. 

Now, we prove a strong separation theorem. Under restricted 
hypotheses, we can show that randomization can give an exponential 
speed-up over determinism. Here, we address a class of routing 
algorithms we call two-hops algorithms. A two-hops algorithm has 
the following properties: 

1) Every processor has two buffers, an A-buffer and a S-buffer; 

2) at the beginning, the packets are stored in the A-buffer of each 
processor; 

3) at every odd slot 2t + 1, t = 0,1,..., every processor i with a 
packet in the A-buffer sends the packet to group c oat (i,2t+ 1) 
(two-hops algorithms can only use unicast), listens to incoming 
packets from group C{ n (i,2t + 1), and store the incoming packet 
(if any) into the S-buffer; 

4) at every even slot 2t, t = 1,..., every processor i sends the 
packet in the S-buffer to destination, reset the S-buffer, and 
listens to incoming packets from coupler c\ n (i,2t). 

Also, we will make the following assumptions: 

5) when multiple packets use the same coupler (multiple packets 
from a group sent to the same group), no packet is delivered. 

6) When a packet arrives to any processor in the destination group, 
it is considered to be successfully routed, and disappears from 
the network (from the original A-buffer as well); 

The last hypothesis simplifies the job of routing all the packets to 
destination — we don't have to take care of acks when packets reach 
their destination. However, since we are proving a lower bound, 
we don't lose generality. Now, our goal is to show that for every 
deterministic choice of functions q n and c out , there exists an input 
permutation such that the routing is completed in fi(logg) slots. On 
the other hand, our randomized algorithm shows that there exists a 
deterministic q n and a randomized c out such that all the packets are 
routed to destination in O(loglogg) slots with high probability. 

Consider a deterministic two-hops algorithm. Assume that the 
algorithm stops after T < ^min{logii,logg} slots, T even. We will 
say that processor i shoots on group a in the first T slots if there 
exists an odd t <T such that c ou t(',f) = a. 

Lemma 2.9: There exists a group oq such that at most dT proces- 
sors shoot on oq in the first T slots. 

Proof: By counting. ■ 

Corollary 2.10: There are at least n — dT = dg — dT > dg/2 
processors i such that processor i does not shoot on ay in the first T 
slots. 

Let P(oq) be the set of processors i such that processor i does not 
shoot on ao in the first T slots. By Corollary 12. 101 |S(«o)| > dg/2. A 
subset A C P(oq) is ^fg-robust if for every i £ A and for every t < T 
there are at least ^fg processors j in A such that c ou t(i,0 = c O ut(i,0- 

Lemma 2.11: There exists a ^/g-robust subset S'(ao) C S(«o) such 
that \P>(ao)\ >%-Tgjg. 

Proof: If P(oq) is not y/g-robust, then there must be a 
processor i 6 P{oq) and a / < T such that c(i,t) = c(j,t) for less 
than y/g processors j £ P(ao). This means that all the processors j 
such that c(i,t) = c(j,t) (including () must be removed from P(oq) 
to get a y/g-robust subset. So, let P\(ao) be obtained from P(oq) by 
removing all these processors and mark the pair (t,e(i,t)). Start now 
from Pi (ao) in place of P(ao) and keep iterating. Notice that no pair 
can be marked twice in the process. The number of pairs is at most 
Tg, and each time we mark a pair we drop at most y/g processors. 
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Theorem 2.12: Any deterministic and two-hops algorithm for on- 
line permutation routing on the POPS(d,g) network, d = 0(g), must 
use £2(log/i) slots. 

Proof: We will show that for every processor i in P'(ao) there 
exists an input permutation such that p; will not reach destination. 
The idea of the proof is as follows: we can build an input permutation 
such that pi has to perform two hops to get to destination, and that 
has a conflict at every even slot. Take a packet p,- such that i 6 P'(ao) 
and mark the packet. Now, for f :=T — 1 downto 1, t odd, do the 
following: 

for every marked packet pj, 

1) take an unmarked packet p/, such that c(h,t) = c(j,t); 

2) mark packet p/,. 

Then, set the destination of all marked packets to processors in 
group ao, so that no marked packet can get to destination in one hop 
(they are chosen from P 1 (ciq) C P(oq)). The number of packets that 
are marked in the above process does not exceed d nor y/g, since 
T < jinin{logrf,logg}. The important property guaranteed by the 
above process is that any packet pi marked at time / will experience 
a conflict during all even slots from the beginning of the routing 
to time t. In particular, packet p; does not reach destination within 
r = &(logn) slots. ■ 
We believe that the fl(logg) lower bound for deterministic routing 
holds in a much wider setting. This is described in the following two 
conjectures. 

Conjecture 2.13: There exists a deterministic algorithm for online 
permutation routing on the POPS(d,g) network, d = 0(g), that is 
optimal and conflict-free. 

Conjecture 2.14: Any deterministic and conflict-free algorithm for 
online permutation routing on the POPS(d,g) network, d = 0(g), 
must use fl(logn) slots. 

III. A Randomized Algorithm 

Here we present our randomized algorithm. In the following, we 
will make use of the so called union bound, a simple bound on the 
union of events. 

Fact 3.1 (Union Bound): Let E\,.. - ,E m be m events. Then, 

'ml m 

Pr [jEi <£Pr[E;]. 

L;=i J i'=i 

We will use a function A(jc) := x modg. Moreover, we will say 
that some event happens with high probability meaning that the 
probability of the event is 1 — 1 /g k for some positive k. 

A. The Case d = g 

Given a packet p,-, i £ N n , its temporary destination group is 
group A(jt(j')) = 7c(i) modg. Note that there are exactly d packets 
with temporary destination group a, for all a 6 N g . The idea of 
the routing algorithm is as follows: Each packet is first routed to a 
randomly and independently chosen random intermediate group, then 
to its temporary destination group, and lastly to its final destination. 
So, we iterate the following step, composed of five slots: 

1) each processor containing a packet p to be routed chooses a 
random intermediate group r (uniformly and independently at 
random over N g ) and sends a copy of packet p to group ;•; 

2) every copy that arrived to the random intermediate group is 
sent to its temporary destination group; 

3) for each copy that arrived to the temporary destination group 
an ack is sent back to the random intermediate group; 

4) for each ack arrived to the random intermediate group, an ack 
is sent back to the source processor which, in turn, deletes the 
original packet; 



5) every copy that arrived to its temporary destination group is 
sent to its destination. 

During the step, there are at most two replicas of the same packet. 
One is the original packet, stored in the source processor; the other 
is the copy, that tries to go from the source processor to a random 
intermediate group, then to its temporary destination group, and 
finally to its destination. In slot 4, if the source processor receives an 
ack, it can be sure that the copy has been successfully delivered, as 
proved in Proposition 13.21 and can safely delete the original packet. 
In fact, the original packet gets deleted in slot 4 if and only if, within 
the step, the copy gets to destination in slot 5. 

In slots 1, 2, and 5, for every group a, every processor i in group a 
is responsible for listening to coupler c(a,A(i)) for the message 
possibly coming from group A(z). This way, every conflict-less 
communication successfully completes and no packet is lost. Indeed, 
during slots 1 and 2, in every group a, a e N ? , the processor with 
index b within the group, b € N g , receives the packet that is possibly 
coming from group b. In slot 5, every processor n(i) that still has to 
receive packet p, hopefully receives its packet from group A(7[(i)), 
the temporary destination group of packet p,-. Slots 3 and 4 behave 
differently. Indeed, each ack sent during slot 3 is received by the 
same processor that sent the packet in slot 2. Similarly, each ack 
sent during slot 4 is received by the same processor that sent the 
packet in slot 1. 

Clearly, during slots 1 and 2, multiple conflicts on the couplers 
should be expected, and many of the communications may not 
complete. For example, two packets in the same group can choose the 
same random intermediate group during slot 1, or two packets willing 
to go to the same temporary destination group are currently in the 
same random intermediate group during slot 2. On the contrary, slots 
3, 4, and 5 do not generate any conflict, as shown in the following 
proposition. 

Proposition 3.2: At all steps, slots 3, 4, and 5 of the routing 
algorithm do not generate any conflict. 

Proof: Consider packet p,- stored at processor i in group a. 
Assume that, during an arbitrary step, its random intermediate group 
is r(i), chosen uniformly at random. In the case when packet p; 
survives slot 1 and arrives to its random intermediate group r(i), 
we know that coupler c(r(i),a) has been used to send packet pi 
only, otherwise a conflict would have stopped the packet. Moreover, 
since there is only one processor in group r(i) that is responsible for 
receiving packet p,, namely processor r(i)d + a, there will be only 
one ack message corresponding to packet p; to be sent in slot 4, 
and this ack message is the only one that uses the symmetric coupler 
c(a,r(i)) during slot 4. In conclusion, slot 4 is conflict-free. A similar 
argument shows that slot 3 is conflict-free as well. 

Consider now slot 5. Assume that, after step 4, packet pj has 
arrived at the same temporary destination group as packet p;. This 
means that A(tt(£)) = A(n(j)). That is, n(i) = 7t(j) modg. In this 
case, it is not possible that 7C(i) and 7t(j) are in the same group; 
otherwise we would have 7C(i) = K(j), in contrast with the fact that n 
is a permutation. Therefore, packets p,- and pj go to different groups 
from their temporary destination group. In other words, step 5 is 
conflict-free as well. ■ 

By Proposition l3.2l if packet p,- survives the first two slots of a step, 
then, in the very same step, it will be routed to its destination, and an 
ack will be successfully returned to source processor i. When the ack 
arrives, the source processor can delete the packet, since it knows it 
will be safely stored by the destination processor. Conversely, if no 
ack arrives, the packet is not deleted, and the processor tries again to 
deliver it in the next step, choosing again a possibly different random 
temporary group. 



Fig. 2. Example of randomized routing in a POPS(3,3) network. Packet p$ has destination 7r(5) = 1 in group 0. Its temporary destination group is group 
3t(5) modg = 1. In this step, the random intermediate group chosen by packet p$ is group 2. 



By the above discussion, we can safely concentrate on slots 1 
and 2. A useful way to visualize the conflicts in slots 1 and 2 of an 
arbitrary step is shown in Figure |3(a)| At any given step of the routing 
algorithm, let n be the restriction of the input permutation to those 
packets that have not been successfully routed yet (during previous 
steps). We build the graph of conflicts, a bipartite multi-graph G n 
on node classes 5 := N g and D := N g . For every group a and for 
each packet pi in group a and yet to be routed, we introduce an edge 
with one endpoint in a £ 5 and the other endpoint in the temporary 
destination group A(jr(j)) 6 D . During slot 1 of the step, every edge 
(packet yet to be routed) randomly and uniformly chooses a color 
in Ng (the random intermediate group). Clearly, a same packet can 
choose different colors in different steps of the routing algorithm. 
Now we can exactly characterize the conflicts in the first two slots of 
the routing algorithm during step s. Packet pi in group a (represented 
by an edge from a £ S to A(tt(j)) £ D) has a conflict during slot 1 
if and only if there is another edge incident to a £ 5 with the same 
random color. Moreover, if we remove all edges relative to packets 
that have a conflict in slot 1 (see Figure |3(b)) , every remaining packet 
Pi has a conflict during slot 2 if and only if there is another remaining 
edge incident to A(jt(i)) £ D with the same random color. Figure [3(c)] 
shows which packets of Figure |3(a)| survive both slots and are hence 
delivered to destination by Proposition 13.21 

Our first result shows that, in case the packets are "sparse" in the 
network, then all the packets can be delivered in a constant number 
of slots with high probability. 

Lemma 3.3: If the maximum degree of the conflict graph is g a 
for some constant a < 1, then the routing algorithm delivers all 
the packets to destination in a constant number of slots with high 
probability. 

Proof: Since the maximum degree of the conflict graph is g a , 
in every group of the POPS network there are at most g a packets left 
to be routed, and every group of the POPS network is the temporary 
destination group of at most g a packets. Let j8 = 1 — a. We show that 
the probability that all packets get routed to destination within 3/j3 
steps is at least 1 —cp/g, where cp := T?l$ is a constant depending 
only on (the constant) j3. Consider a generic packet p, in group a. 
The probability that packet /?, has a conflict in one step is at most 
equal to the probability that either one of the packets in group a or 
one of the packets with temporary destination group A(tt(;)) chooses 
the same random intermediate group as packet p,. Since at most 
g a — 1 other packets are in group a, and similarly at most g a — 1 
have temporary destination group A(n(i)), this probability cannot be 



larger than 2g a /g = 2g P. Therefore, the probability that the packet 
is not routed in each of the 3/j8 steps is at most 

/_2\£ _ 2 3 /P _ c p 

By the union bound, the probability that any of the g l+a < g 2 packets 
in the network has not been routed in 3//3 steps is at most cp/g. ■ 
As a matter of fact, the hard part of the job is to reduce the initial 
number of g packets in each group in such a way to get a "sparse" 
set of remaining packets. We can prove that this is done quickly by 
our randomized algorithm by providing sharp bounds on the number 
X of packets that are successfully delivered in a step. We define X as 
a sum of indicator random variables Z,-, where Z; is equal to 1 if the 
;-th packet is delivered in this step, and otherwise. It is important to 
realize that these random variables are not independent: the event that 
one packet has a conflict influences the probability that another packet 
has a conflict as well. As a consequence, we cannot use the well- 
known Chernoff bound to get sharp estimates of the value of X since 
there does not seem to be any way to describe the process as a sum 
of independent random variable. So, we need a more sophisticated 
mathematical tool. Specifically, we will see that slots 1 and 2 of one 
step of the routing algorithm can be modeled by a set of martingales. 
Martingale theory is useful to get sharp bounds when the process is 
described in terms of not necessarily independent random variables. 

For an introduction to martingales, the reader is referred to [22]. 
Also [23], [24], [25], and [26] give a description of martingale theory. 
Here, we give a brief review of the main definitions and theorems 
we will be using in the following. 

Definition 3.4 ([22]): Given the fj-field (Q,F) with F = 2 a , a 
filter is a nested sequence Fo C Fj C • • • C F m of subsets of 2 n such 
that 

1) F o = {0,£2}; 

2) F m = 2"; 

3) for < h < m, (£i,F A ) is a rj-field. 

Definition 3.5 ([22]): Let (£2,F,Pr) be a probability space with a 
filter Fo,...,F m . Suppose that Yo,...,Y m are random variables such 
that for all h > 0, Z;, is F,-measurable. The sequence Zrj, . . . ,Z m is a 
martingale provided that, for all h > 0, 

E[Z h+l \¥ h ]=Z h . 

The next tail bound for martingales is similar to the Chernoff bound 
for the sum of Poisson trials. 



(a) Conflict graph G n \ 



(b) conflict graph G K , where 
only packets surviving slot 1 
are shown; 



(c) conflict graph G K , where 
only packets surviving both 
slot 1 and slot 2 are shown. 



Fig. 3. Conflict graph G n , where permutation n = [1,5, 8,9,3, 10, 11, 14, 15, 13,0,7,2,6, 12,4] (consequently, A(7r( )) = [1,1,0,1,3,2,3,2,3,1,0,3,2,2,0,0]), 
in a POPS (4, 4) network. 



tingale such that for each h, 

\Zh — Z h-l \ < c h, 
where c/, may depend on h. Then, for all / > and any A > 0, 



Pr[|Z,-Z | >l]<2e 2 4=i<* . 

Theorem 3.7: A POPS (g,g) network can route any permutation in 
O(loglogg) slots with high probability. 

Proof: Let G n = (S,D;E) be the conflict graph at step s of 
the routing algorithm, where % is the input permutation restricted to 
those packets that still have to be routed at the beginning of step s. 
Let d s be the maximum degree of G n . So, at step s there are at 
most d s packets left to be routed in every group, and at most d s 
packets are willing to go to the same temporary destination group. 
Clearly, d\ < d. We will show that after O(loglogg) steps the conflict 
graph has maximum degree at most g 5 / 6 . This is enough to prove 
this theorem by Lemma l3~3l 

Assume to be at step s. If d s < g 5//6 , then we are done. So, we 
can assume that d s > g 5//fi . Let S a , a £ 5, be the set of indices of the 
packets of group a that still have to be delivered at the beginning of 
step s. Similarly, let D/,, b 6 D, be the set of indices of the packets 
in the whole network that still have to be delivered and that have 
group b as temporary destination group. Clearly, \S a \ and \Db\ are 
the degrees of nodes a € 5 and b 6 D in the conflict graph of step s. 
Therefore, |5 | < d s and |Dj| < d s for every a € S and b ED. For 
every packet p; still to be routed, we define the following indicator 
random variable, 



Z 



1 if packet p,- survives slot 1 in step s, 
otherwise. 



Random variable X^ = Y.ieS„ tells the number of packets from 
group a that survive slot 1 ; random variable F fc = Y,jeD b Zj tells the 
number of packets with temporary destination group b that survive 
slot 1. Moreover, let random variable C; be equal to the color chosen 
by packet p, in step s. 

Clearly, we have nothing to show about the nodes in G n that have 
degree smaller than or equal to g 5 / 6 . So, we define sets 5 + C5 and 
D + C D, which collect the nodes with degree larger that g 5//6 , and 



focus on the nodes in these sets. Consider an arbitrary node n£S + . 
The expectation of Zj, i e S a , can be bounded as follows: 

E[Zf] = Pr[V h 6 S a \ {/}, C h £ Q] = ]J Pr[Q / Q] 



hes a \{i} 



15.1-1 m 
>e -K\lg, 



So, the expected number of packets in group a that survive slot 1 
can be bounded accordingly, 



E\X, 1 =E 



iGS. 



= £ E[Z/] > \S a \e 

ies a 



-\s*\/g 



(2) 



In order to show that random variable X„ is not far from its 
expectation with high probability, we now define random variables 
W h = E[X*\F h ], h = 0, ...,|S a |, where ¥ h is the rj-field generated 
by the random color chosen by the first h packets in 5 fl . Filter F/,, 
h = 0, . . . , \S a \, is such that Wo, • • • ,W^s o | is a martingale and that 
\Wf, — Wfe_ 1 1 < 2, since fixing the random color chosen by the h- 
th packet in S a can only affect the expected value of the sum X^ at 
most by two. By the Azuma's inequality, for every 8 > 



Pr -EK»]| > SE[X^]\ = Pr [|^ 5o | - W \ > 



< 2e ^r-) 1 < 2e~ 



S 1 \Sa\ 1 e- ld 'ls 



< 2e" 



(3) 



To prove a similar result for F, , b 6 D + , we must recast the above 
general martingale arguments into a more structured approach. This 
is because F, 1 may depend on the random colors chosen by all the 
packets in the network, and not only on those chosen by the packets 
in D b . 

Consider an arbitrary node b 6 D + . In the following analysis of 
the expectation and concentration of F, 1 we can clearly pretend that 
the random colors are first choosen for the packets outside D\, and 
later for the packets in D/,. This will not invalidate our conclusions 
about the whole of the 's, b e D + , since these will be derived from 
the solid claims about any single by the union bound. For every 
a 6 S a , we define set C a r as N g \C r, where C fl r is the set of colors 
that are chosen in step s by a packet in group a that has temporary 
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destination group different from b, 



C fl , 5 = N,\ U {Ci} ■ 
The average size of C fl j is 

/ I \ \S„\D„\ 
E[|P 6 ,|]=g^l--J 

Being just a classical ball and bins problem [22], we know that 
random variable \C r| is not far from its expectation with probability 

_ _ S 2 E[|C a? |] 2 S 2 S 

Pr C fl ,5l < (1 - 5 ) E fl C a,*l] < e ~ < e ~^> 

for every S > 0. By the union bound over the g nodes in S, for every 
S > 0, we know that for every node d£S 



with probability 



C a - b \>(l-5)g 1- 



1-ge 



S 2 S 



(4) 



(5) 



Under the hypothesis that Equation [4] holds for every a £ 5, we 
can bound the expectation of Zj, j G D/,, as follows: 

E[zj] = Pr [(v h e d 6 nsj , \ {;}, C* ?4 C,) A (Cy e P bMj ) 
where aj is the group of packet pj. So, 



/ i \ \Dtns 1 \{j}\ f \\KSpi\ 

=( M ) (,-;) fi ' , "">„- i „-'j» 

The expectation of Y, can be bounded accordingly, 



Eft, 1 



£E[Zj]>(l-«)|Z) fc |e-l D »l/«. (6) 



In order to show that random variable Y b is not far from its 
expectation with high probability, we now define random variables 
W k = E[Y h 1 \¥ k ], k = 0,...,\D b \, where ¥ k is the a-field generated 
by the random color chosen by the first k packets in D b . Filter ¥ k , 
k = 0, \D b \, is such that Wo, . . . jWujj is a martingale and that 
Wk~ Wjfc-il < 2, since fixing the random color chosen by the k-th 
packet in D b can only affect the expected value of the sum Y b at 
most by two. By the Azuma's inequality, for every S > 



Pr[|r,; -E^ > SEptf]] =Pr -Wb| > $Wb]} < 



< 2e 2 U2) 2 < 2e 



5 2 {\-5) i \D b \ 2 e- 2d 'lx 



< 2e 



8 2 (1-S) 4 g 5 / 6 

ES 



(7) 



Let G % i = (S, D;E r ) be the conflict graph at step s, where %' is 
the input permutation restricted to those packets that survive slot 1 
in step s. Hence, E' C E. Our goal is to bound the number of packets 
that survive slot 2 as well, and are thus delivered to destination during 
this step. Let Z? be equal to one if packet pj survives both slots 1 
and 2, and zero otherwise. Also, let S^, a 6 S, be the set of indices 
of the packets of group a that have survived slot 1. Similarly, let 
Dl, b G D, be the set of indices of the packets in the whole network 
that have survived slot 1 and have group b as temporary destination 
group. Clearly, for every a e S, is equal to X\ and is the degree 



of node a in G n <; while for every b £ D, \dV\ is equal to Y^ and is 
the degree of node b in G % i. Random variables 

X a = L Z 7' 

a G 5, tell the number of packets in group a that are delivered during 
step s; similarly, random variables 

b G D, tell the number of packets willing to go to temporary 
destination group b that are delivered during step s. 

Consider an arbitrary node b G D + . The expected value of Y^ 
depends on permutation 7t' . Since we are computing a lower bound 
to K fc 2 , the worst case is when all packets in originate at different 
groups. Indeed, if two packets in D\ belong to the same S a , we 
already know that they have chosen two different colors during step s, 
and the expectation of Y^ is larger. A formal proof of this intuitive 
claim can be given, though it's omitted for the sake of brevity. 
Assuming that random variable Y^ is not far from expectation as 
in Equation we can bound the expectation of Y^, 



m] = \K 



i- 



> {l-8) 2 \D b \e-\ D, '\ls ( 1 - 



>(l-8) 2 \D b \e 



-\D b \/ S -\Dl\/g 



>(l-S) 2 \D h \e 



-24,/s 



(8) 



lust as before, also Y^ is not far from its expectation. Martingale 
theory can be used again to show that 

2 E{Y 2 f 



Pr - E[Y b 2 ] | > SE[}f ]] < 2e < 2e 



a 2 (i-5) 4 i!; V 6 



(9) 



Similarly, by using the same technique that has been used to bound 
random variable Y b l , for every node a e S + we can show that 



E[X 2 ]>(l-d)\Sl\ 1- 



>(l-S)\S l a \ t 



-\sl\/g 



>{\-8) 2 \S a \e-\ S "\lz e -\ s °\lz> 
>(\-8) 2 \S a \e- 2d °ls, 

and that X 2 is not far from its expectation 



Pr 



< 2e 2 ^( 2 ) 2 < 2e 



— — 



(10) 



(11) 



By Equations I2ll^l4ll^l6l 171 l8ll9l fT0lfTT1 and by the union bound, 
the number of packets successfully delivered in step s can be bounded 
as follows: For every 5 > 0, 



X 2 >(l-S) 3 \S a \e- 2d ^ 



Y 2 >{i-8f\D b \e 



-2d,, 



for every a£S + and b 6 D + , with probability at least 

l-< 



a 2 (i-;s) 4 gV6 




(12) 
(13) 

(14) 



Now, we divide our analysis into two phases. Phase 1 is composed 
of a constant number of steps and, with high probability, reduces the 
maximum degree of the conflict graph from d\ to gx or less, where 
< x < 1 is any fixed constant. Phase 2 follows and reduces the 
maximum degree of the conflict graph to g 5//6 or less in O(loglogn) 
steps with high probability. 

Let us start from Phase 1 . For every step s during Phase 1 , gx < 
d s < g- We show that a constant number of steps is enough to make 
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d s fall below gx with high probability. For all a £ S + , let us refer to 
a step such that 

\Sa\e- 2 

1 — (15) 



as a lucky step for group a. By Equation 1121 and l"i"4l where we fix 
8 such that (1 — <5) 3 = 1/2, step s is lucky for every group a e 5 + 
with probability at least 



1 



g -a|S„| > j 



-ag 5 ' 6 



where a is a positive constant. Therefore, the number of packets that 
remain after step s in group a e S + is 



\Sa\-XZ<\S a \ 



\Sa\e 



-2 



<d s 1 



„-2 



(16) 



with high probability. Note the same bound can be shown for 
sets \D}\, b e with exactly the same analysis (where an analogous 
notion of lucky step refers to a step such that the degree of group b 6 
D reduces by |Df' |e~ 2 /2 at least). Therefore, after 



y ■ 



logx 



log(l- e - 2 /2) 



lucky steps for all the groups the maximum degree of the conflict 
graph reduces to gx or less. By the union bound, this happens within 
the very first y steps with probability at least 



1 



-9y g e-^'\ 



That is, Phase 1 completes in a constant number of steps with high 
probability. 

We are now at a generic step s in Phase 2. Our goal is to reduce the 
degree of the graph of conflicts to g 5 / 6 . Let X s = d s /g. We can assume 
that g -1 / 6 < A s < x, and when X s falls below g~ 1//6 we are done. 
This time, let's refer to a step during which at least (1 — A s )|Sa| e_ 
packets in group a e S + are delivered as a lucky step for group a. 
By Equation 1121 and 1141 where we take S s = A s /3 (in such a way 
that (1 — <5. s ) 3 > (1 — As)), step s is lucky for every group a e S + with 
probability at least 

\-9yge-M 12 , 

where j3 is a positive constant, since |S fl |A 2 > g 5 / 6 (g~ 1,/6 ) 2 = g I//2 - 
So, the number of packets that remain in group a £ 5 + after step s 
is 

[S a | -X 2 < [S a | - (1 - A s )|S fl |<r 2A ' < d s [l- {\-Xs)e~ 2X * 

with high probability. A similar result can be shown for any group b € 
D such that |Dj| > g 5 / 6 with exactly the same analysis. By the union 
bound, at the end of step s the degree of the conflict graph is at most 

d s [\-{\-X s )e- 2X > 

with high probability. Now, assuming a sequence of lucky steps, we 
can set up the following recurrence, 



< X s [l - (1 - A s ) e - 2A <] < X s [1 - (1 - A s )(l - 2A S )] : 
= X s [l - 1 + 3A S - 2A 2 ] < 3A 2 . 



Therefore, 



X s < 3A/_i < 3 3A/_ 2 ) < • • • < 3 Z Xj +l 



That is, 



log 3 X s < log 3 (3 2 "'~' X 2 XV) = 2 S_V_1 (1 +log 3 A v 



Since our first goal is to have X s < g 1 / 6 , we should find s such that 

log 3 A s - < — . 

o 

We can get this by taking i such that 

2^- 1 (l+log 3 A v+I )<-^. 

If we choose the arbitrary constant x of Phase 1 to be strictly smaller 
than 1/3, we obtain that l+log 3 A v+ i is negative, and the above 
equation comes down to s = O(loglogg). Therefore, by the union 
bound over the s — y — 1 steps of Phase 2, the whole Phase 2 is made 
of lucky steps for all the groups in S + and D + with probability at 
least 



\-9{s-y-\)ge 



1-0 (ge 



loglogg 



+ 1, 



We have shown that, after s= O(loglogn) steps, the maximum 
degree of the conflict graph G n is at most g 5 / 6 with high probability. 
This is enough to get the claim of our theorem by combining Phase 1 
and Phase 2, and then using Lemma l3~3l ■ 

We remark that all transmissions occurring during slots 3 and 4 
are just acks requiring only "empty" messages providing only headers 
but without payload. When packets are very long, it may be more 
efficient to divide the 5 slots into 2 "short" slots and only 3 "long" 
slots, hence profiting from the homogenity of the operations within 
a same slot in our routing algorithm. 

Note an important property of our algorithm: processor / requires 
enough memory to store at most three packets: one is the original 
packet pi, the second is the packet whose destination is processor 
and the third is a copy of another packet as received from group A(i). 
However, if we can assume that packet p, exits the network the slot 
after p,- got to its destination n(i), then the requirement on the internal 
capacity of processors drops to only 2 packets. Similarly, if we can 
assume that the input packets are stored on an external feeding line, 
then the internal storage requirement drops to 1. 

B. The General Case 

Let start from the case when d > g. A natural approach to solve 
the problem is to perform two stages: Stage 1 routes the packets until 
the degree of the conflict graph is at most g; then Stage 2 uses the 
randomized algorithm described in the previous section to route the 
remaining packets in O(loglogg) slots. Since at most g packets can 
be moved without conflicts from each group in each slot, (d— g)/g 
is a simple lower bound to the number of slots used in the first of 
the two above mentioned stages. In the following, we will show that 
we are only a constant factor far from the lower bound, and that we 
can precisely indicate this factor. 

Consider a group aeN s . From this group, there are d > g packets 
willing to go to destination. If we let every packet choose a random 
destination group and try to reach that group, when d is large (it is 
enough that d = fl(glogg)) every coupler will have a conflict with 
high probability and no packet is delivered. Clearly, this is not what 
we like to happen. So, the idea for the first stage of the algorithm is a 
small modification of the randomized algorithm: Before participating 
to the step, every processor with a packet tosses a coin that says 'yes' 
with probability p. Only those processors that get a 'yes' are allowed 
to participate and send their packet. 

In the first step, it is best to choose p equal to g/d, in such a 
way that g packets are sent on expectation. This value maximizes 
the expected number of conflict-less communications, and thus the 
number of packets that survive slot 1 and slot 2. Later on, p has to 
be iteratively reduced using a fixed law according to the expected 
reduction of the number of packets left in each group. When at most 
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g packets are left in each group with high probability, then we can 
set p to one, and so proceed with the same algorithm we propose for 
the case when d = g. 

To understand what is the most efficient law, it is important to 
understand what is the expected number of packets that are delivered 
in each step of the algorithm. Informally speaking, our hope is that 
exactly g packets from each group participate to every step of the first 
phase of the algorithm. Under this assumption, we know that approx- 
imately ge -1 packets of each group will survive the first slot. At the 
beginning of the second slot, these packets are somewhat randomly 
scattered in the network (not uniformly at random, unfortunately, as 
we know from the previous section). If everything goes just like in 
the first slot, and this is far from being obvious since the destination 
is not random now and the packets are not distributed uniformly 
at random, we can hope that gexp{ — (1 packets from each 

group survive the second slot as well, and are thus safely delivered. If 
this is the case, exp{l +e~'}((rf — g)/g) steps are enough to reduce 
the number of packets from d to g on expectation. The following 
theorem shows that, eventually, what happens is exactly what we can 
best hope for. Now, we proceed formally. 

Theorem 3.8: Let c = exp(l + e _1 ) ~ 3.927. A POPS(d,g) net- 
work can route any permutation in 5c\d/g] + o(d/g) + 0(log logg) 
slots with high probability. 

Proof: The idea of the algorithm is to use [~(c + e(g))( j — 
1)] steps, where e(g) = o(l), to reduce the maximum degree of the 
conflict graph to at most g with high probability. Since every step 
consists of 5 slots, we then get the claim by Theorem 13.71 

Every step s, s = 1, . . . , \(c + e(g))(| — 1)] , is similar to the 
standard step of the randomized routing algorithm, with the difference 
that, before choosing its random color during slot 1, every packet in- 
dependently tosses a coin and participates to the step with probability 



and 



d- 



Our claim is that, at the beginning of step s, s = 1, . . . , [~(c + e(g))(| — 



1)] + 1, the degree of the conflict graph is at most d s : = d — 

with high probability. As a consequence, when s = [*(c + £(g))(| — 
1)] + 1, we get d s < g as desired. The claim is certainly true when s = 
1. Assume it is true at the beginning of step s < \(c + £(g))(j; — 1)] . 
We show that it is true at the beginning of step s + 1 as well. 

Let S a , a £ S, be the set of indices of the packets in group a that 
still have to be delivered at the beginning of step s. Similarly, let 
Dj,, b £ D, be the set of indices of the packets in the whole network 
that still have to be delivered at the beginning of step s and that have 
group b as temporary destination group. By hypothesis, \S a \ < d s and 
\D/,\ < d s for all a 6 S and b £ D. Our first goal is to prove that at 
the beginning of step s + 1 the degree of the conflict graph is at most 
d s+ i with high probability. 

For every packet p,- yet to be routed, let random variable Pj be 
equal to 1 if packet p; participates to step s, and otherwise. Random 
variable P a = Y.ieS Pi counts the number of packets in group a that 
participate to step s. The expectation of P a can be computed as 
follows: 

E[iy = £E[p,] = ^. 

iU, ds 

And, clearly, E[P ] < g. Since random variables Pj are independent, 
the Chernoff bound [22], [25] (note that in [22] this bound appears 
in a different yet stronger form) is enough to claim that for every 
<5>0 



Pr 



Pa<(l-S) 



Sal 



<e 



< e~ 



< e~ 



Pr [P a > (1 + 8)g] < e 2 * < e 2 * < e~ 



Let S' a , aeS, be the set of indices of the packets in group a 
that participate to step s. Random variable P a is thus equal to \S' a \. 
Therefore, for every 8 > 

\Sa\g 



(i-sy-^<s> a <(i+s)g 



(17) 



with probability at least 1 — 2e 62 #/ 4 . Since a similar result holds for 
every a £ 5 and b £ D, we also know that for every 8 > 

\Sa\g 



(l-8)^<S'a<(l + S)g, 

(1-5)^ <D' b <(! + %, 



(18) 
(19) 



hold for every a £ S and b £ D, with probability at least 

l-4ge- S2 */\ (20) 

by the union bound over the 2g nodes of the conflict graph. 

Clearly, we have nothing to show about the nodes in the conflict 
graph that have degree smaller than or equal to d s+ i. So, we define 
sets S + C S and D + C D, which collect the nodes with degree larger 
that rfj+i, and focus on the nodes in these sets. Consider an arbitrary 
group a £ S + , and assume that the bound in Equations 1 1 8l and [T9l hold 
for every a £ 5 and b £ D. Now, we can perform the same analysis 
as in the proof of Theorem 13.71 Similarly to Equation 1 101 we know 
that 



E[X2]>(1_S)| S 1| 



1 \ l^l-l 

l-£) >(l-8)\Sl\e-Wt>, 



with high probability. In the next equation, we will use the following 
two facts: xe x / g < ye y / g whenever x < y < g, and xe*/ s has maximum 
when x = g. Clearly, \S\\ <g (there are only g couplers from group a). 
So, we get 

E[X fl 2 ]>(l-S)|S>-l^> 

>(l-5) 2 |5^| e -l s «l/« e -l s «l eHS ° l/s /s> 



>(1-S) : 



\S a \g-i- e -\ 



with high probability. By setting 8 = g '/ 3 in the above equation, 
with high probability we get 

\S a \ g 



x},> 



ds c + £(g)' 

where c = e l+e ' and e(g) = o(l). Since is the number of packets 
in group a that are delivered to destination during slot s, the degree 
of group a in the conflict graph at the beginning of step s + 1 is 

\Sa\ I 



\S a \-X a Z <\Sa\- 



d s c + e(g) 



<d s - 



g 

c + £(g) 



= d. 



■s+l- 



The same result can be shown for every a £ 5 + and b £ D + . By 
the union bound over the \{c + e(g))( j — 1)] steps required, and 
over the 2g nodes in the conflict graph, and by Equation 1201 and 
a corresponding version of Equation 1141 the degree of the conflict 
graph is reduced below g with probability at least 



1 



-5 2 (l-5)V /6 /8« 2 



-S 2 g/A 



Note that this is 1 — o(l) as g grows. ■ 
To get a feeling of the performance of our randomized algorithm, 
we can set e(g) ~ 0.073 in the proof of the above theorem, in such a 
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way that c + £ (g) = 4. The result is claimed in the following corollary. 

Corollary 3.9: A POPS(d,g) network can route any permutation 
in + 0(loglogg) slots with high probability. 

IV. Experiments 

Our results in Theorems 13.71 and 13 . 81 are asymptotic. In principle, 
it could thus be possible that the randomized algorithm does not 
perform well in practice. This is not the case. Experiments show that 
it outperforms the algorithm in [8] even on networks as small as a 
POPS(2,2), and proves to be exponentially faster when d and g grow. 

The algorithm in [8] is claimed to run in y log 2 g + ^j- + 
31ogg + 7 slots. However, the authors make a small mistake when 
saying that Leighton's implementation of the odd-even merge sort 
algorithm is composed of log 2 n steps. The actual complexity is only 
log«(l+log«) ^ 21og 2 g steps. So, the running time of the routing 
algorithm in [8] is ^ log 2 g + ^ logg + + 31ogg + 7 slots, that 
is smaller, and this is what we will use in the following. 

To perform the experiments, we built a simulator for the POPS 
network. It is written in C++ and simulates the network at a message 
level. That is, for every message in the real network, there is a mes- 
sage in the simulator. Processors (implemented as instances of a class 
Processor) locally take decisions about the next step to perform, 
and couplers (implemented as instances of a class Coupler) locally 
propagate messages or stop them in case of conflicts. 

Then, we implemented our randomized algorithm in the simulator, 
slot by slot. We have been conservative, no theoretical result is 
taken for granted and the randomized algorithm is just simulated 
message by message. Not surprisingly, slots 3, 4, and 5 prove to 
be conflict-less, supporting what is proven in Proposition 13.21 So, 
whenever a copy survives slots 1 and 2 it reaches its final destination, 
and the associated ack successfully gets to the source processor. 
Moreover, three buffers in every processor i (one for packet p,-, one 
for packet p K -i(i), and the third for floating copies of other packets) 
are enough. 

In Figure [4] it is shown the average over a large number of 
experiments in the case when d = g. The number of processors 
n = dg goes from 4 to 16,777,216. The permutation in input is chosen 
uniformly at random from the class of all possible permutations. It 
is clear, from the results shown in the figure, that our algorithm 
is much faster than the algorithm in [8] even in practice. Actually, 
our algorithm outperforms its competitor for all network sizes hence 
putting aside any possible concern about the hidden consts. The 
performance of our algorithm is so good that it is actually hard to 
appreciate it from Figure|4| Hence, Table|Jshows the exact numerical 
results. 

Then, we tested our algorithm on POPS networks with d larger 
than g. We performed two sets of experiments, one in which d = 
4g and another in which d = 16g. In both cases, the number of 
processors goes from 4 to 16,777,216. We used the algorithm as 
implemented in Corollary 13.91 Therefore, we expect the routing to 
take 20 j + O (log log g) slot, according to our theoretical results. In 
fact, the results that are shown in Table Figure [5] and Figure |6| 
show that the hidden constants are very small, and that our algorithm 
dramatically outperforms the best deterministic algorithm known in 
the literature for all network sizes we tested. Finally, Table ITU shows 
some more details: for each experiment, we report the average number 
of steps, the standard deviation, and the worst case over one hundred 
runs. Note that the standard deviation is extremely small (smaller than 
one), therefore, the performance of our algorithm is almost always 
very close to expectation. 
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14.75 
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16 


20.90 
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71.40 
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82.80 


177 


317.90 
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256 


30.10 
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87.15 
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322.45 


669 


1,024 


32.50 


153 


92.60 


391 


343.10 


1,024 


4,096 


34.50 


202 


94.00 


546 


345.60 


1,507 


lo,3o4 


35.20 


259 


94.95 


733 


339.25 


2,118 


65,536 


35.55 


324 


95.15 


952 


336.45 


2,857 


262,144 


36.55 


397 


95.35 


1,203 


334.30 


3,724 


1,048,576 


38.25 


478 


95.65 


1,486 


333.55 


4,719 


4,194,304 


39.70 


567 


96.25 


1,801 


333.05 


5,842 


16,777,216 


40.05 


664 


97.05 


2,148 


333.60 


7,093 



TABLE I 

Number of slots to route a randomly chosen permutation by 
our randomized algorithm (a) and by the algorithm in [8] (b). 



V. Conclusion 

In this paper, we introduced the fastest algorithms for both de- 
terministic and randomized on-line permutation routing. Indeed, we 
have shown that any permutation can be routed on a POPS(d,g) 
network either with O(^logg) deterministic slots, or, with high 
probability, with 5c\d/g] + o(d/g) + 0(log logg) randomized slots, 
where c = exp(l + e ) ~ 3.927. The randomized algorithm shows 
that the POPS network is one of the fastest permutation networks ever. 
This can be of practical relevance, since fast switching is one of the 
key technologies to deliver the ever-growing amount of bandwidth 
needed by modern network applications. 
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TABLE II 

Number of iterations (mean, standard deviation, and worst case over one hundred runs) to route a randomly chosen 

permutation by our randomized algorithm. 
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